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Abstract — Learning algorithms are essential for the applica- 
tions of game theory in a networlting environment. In dynamic 
and decentralized settings where the traffic, topology and 
channel states may vary over time and the communication 
between agents is impractical, it is important to formulate and 
study games of incomplete information and fully distributed 
learning algorithms which for each agent requires a minimal 
amount of information regarding the remaining agents. In 
this paper, we address this major challenge and introduce 
heterogeneous learning schemes in which each agent adopts a 
distinct learning pattern in the context of games with incomplete 
information. We use stochastic approximation techniques to 
show that the heterogeneous learning schemes can be studied 
in terms of their deterministic ordinary differential equation 
(ODE) counterparts. Depending on the learning rates of the 
players, these ODEs could be different from the standard 
replicator dynamics, (myopic) best response (BR) dynamics, 
logit dynamics, and fictitious play dynamics. We apply the 
results to a class of security games in which the attacker and the 
defender adopt different learning schemes due to differences in 
their rationality levels and the information they acquire. 

I. Introduction 

Distributed iterative schemes play an important role in the 
computation of equilibria and the estimation of payoffs under 
incomplete information [2]. This paper studies a two-person 
zero-sum stochastic game with an arbitrary number of states 
and a finite number of actions for each player When each 
player has a complete knowledge of its payoff function and 
has past access to past actions of the others, then there is 
an arsenal of tools such as fictitious play algorithms, best 
response dynamics, and gradient-based algorithms, that can 
be used to arrive at the equilibrium of the game. However, 
it is well known that these algorithms may fail to converge 
even under the perfect observation of actions and payoffs [3], 
[5], [10], [11]. A new learning challenge hence arises when 
a player does not know its own payoff function and/or has 
no information about the past actions of the other players. In 
this case, the player needs to interact with the environment 
to find out its expected payoff and its optimal strategy. 

In practical applications, we are often in search of dis- 
tributed learning algorithms that require a minimal amount 
of information and a minimal amount of resources. It is 
then natural to ask whether there exists a learning scheme 
that demands less information and less memory within a 
dynamically evolving environment, and leads to an efficient, 
stable and fair outcome. In this paper, we address this 
challenge by proposing a class of heterogeneous learning 
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algorithms in a scenario where the players do not know their 
own payoff functions. At each time t, each player chooses 
an action and receives a numerical value for its payoff or 
perceived payoff as an outcome of the instantaneous game. In 
contrast to fictitious play and best response dynamics which 
require the knowledge of the history of actions played by the 
other players, our learning algorithm relaxes this assumption. 
Indeed, it is often implausible and impractical in applications 
to assume the capability of observations of the actions of the 
other players. Furthermore, we assume that the state space 
of the game and its transition law between the states are 
unknown to the players. In addition, the players also do not 
have the knowledge of the action spaces of the others. The 
question we will address is how much the players can expect 
to learn under such circumstances? 

We propose different coupled (or combined) and fully 
distributed learning schemes that enable learning optimal 
strategies and concurrently estimating the optimal payoffs. 
In contrast to the standard reinforcement learning algorithms 
which focus only on either strategy or payoff reinforcement 
for the equilibrium learning, the algorithm that couples 
the payoff -reinforcement learning together with strategy - 
reinforcement learning enables an immediate prediction and 
updates the strategies by updated estimations based on recent 
experiences. Our learning algorithms also offer the degrees of 
freedom to model different levels of rationality and learning 
rates of the players. The ordinary differential equations 
(ODEs) associated with the stochastic learning algorithms 
differ from the standard replicator dynamics, best response 
dynamics and fictitious play dynamics. Particular connec- 
tions to logit dynamics and imitative logit dynamics are also 
established. Using basic stochastic approximation techniques 
from [3], [6], [9], [10] and under suitable assumptions on the 
learning rates, we show their convergence to a new class 
of game dynamics and asymptotic properties of different 
learning algorithms within a class of zero-sum stochastic 
games. 

The paper is structured as follows. In next section, we 
present the zero-sum stochastic game model and provide an 
overview of the basic properties of reinforcement learning 
algorithms. Section |lll] presents our main results on hetero- 
geneous learning algorithms. In Section IV, we apply the 
learning algorithms to study security games and provide nu- 
merical results. Section IV] concludes the paper and discusses 
future work. 

II. Game Model and Learning Algorithms 

In this section, we formulate a two-person zero-sum 
stochastic game model S = (5, Ai, A2, {U{s, ■)}ses) where 
Ai,A2 are the finite sets of actions available to players PI 



and P2, respectively, and S is the set of possible states. We 
assume that the state space S and the probability distribution 
on the states are both unknown to the players. A state s G S 
is an independent and identically distributed random variable 
defined on the set S. We assume the action spaces are the 
same in each state. The zero-sum game is characterized by 
a single utility function U : S x Ai x ^2 — !■ R- PI collects 
a payoff ?7i(s, 01,02) = J7(s, 01,02) when he chooses 

01 S Ai and P2 uses 02 G A2 at state s £ S, and for 
the same choices P2 collects a payoff of [/2(s, Oi, 02) ~ 
c — U{s, Oi, 02); equivalently, U{s, oi, 02) — c is cost to P2, 
where c is a constant. In terms of the single utility function 
U, PI is the maximizer and P2 is the minimizer, and both 
players are interested in the performance at steady state using 
mixed strategies, as to be made clear shortly. The preceding 
game model can be viewed as a special class of stochastic 
games in which the state transitions are independent of the 
player actions as well as the current state. Note that what we 
have here is a constant-sum game, where the constant is c. In 
the analysis of its equilibrium, we can let c = without any 
loss of generality, and hence view it as a zero-sum game. 

We have slotted time, t e {0,1,...}, when players pick 
their mixed strategies as functions of what has transpired 
in the past, to the extent the information available to them 
allows. Toward this end, we let ft{ai) and §1(0,2) denote 
the probabilities of PI choosing oi G Ai and P2 choosing 

02 G A2, respectively, at time t, and let ft = [,ft{ai)]aieAi 
and gt = [gt{o-2)]a2eA2 the mixed strategies of PI and 
P2 respectively (at time t), where more precisely 

ft G := I f : /(oi) G [0, 1], ^ /(oi) = 1 I ; (1) 

gt G g := J g : ,9(02) G [0, 1], ^ 5(02) = 1 I . (2) 
I 026^2 ) 

In particular, we define eai,ea2, with oi G »4i, 02 G ^2, 
as unit vectors of sizes |^i| and \A2\ , respectively, whose 
entry that corresponds to oi or 02 is 1 while others are 
zeros. We assume that the mixed strategies of the players 
are independent of the current state s. For any given pair of 
mixed strategies, (f G J^, g G G), and for a fixed s G S, 
we define the expected utility (as expected payoff to PI 
and expected cost to P2) as U(s,f, g) := Ef,gL/(s, oi, 02), 
where Ef .g denotes expectation of U over the action sets 
of the players under the given mixed strategies. A further 
expectation of this quantity over the states s, denoted E^, 
yields the performance index of the expected game. We now 
define the equilibrium concept of interest for this game, that 
is the saddle-point equilibrium: 

Definition II-A (Saddle Point): A strategy pair (f*,g*) 
constitutes a saddle point for the expected game if and only 
if Vf G J" and g G ^, 

E,U(s,f,g*) «;E,,U(s,f*,g*) «;E,U(s,f*,g). (3) 
This now being a finite zero-sum game (or constant sum 
game, if c 7^ 0), the existence of a saddle point is guaranteed 
by the minimax theorem. 



We now consider this game played over the discrete- 
time horizon, with the players generating mixed strategies, 
say (ftjgf) at every time point t. These strategies will 
be generated (recursively updated) according to some rule, 
which uses the information available to the players. As 
indicated before, the players do not know the functional 
form of U, that is they do not know the entries of the 
underlying matrix, but at each time t they observe the value 
U{s, ai^t, o,2.t), where the actions are realized under (ft, gt), 
and they recall their own past actions. With this information, 
PI and P2 generate, respectively, ft+i and gt+i. The precise 
way of doing this is determined by the algorithm picked, and 
there will be several such algorithms as will be discussed 
shortly. For each one, our goal is to show that the sequences 
thus generated converge to the pair of mixed saddle-point 
strategies, that is limt_j.oo ft = f * , limf _j.oo gt = g* , where 
the limit will be given a precise meaning later 

A. Learning Schemes 

To achieve the saddle-point solution, we suggest the fol- 
lowing reinforcement learning mechanism for homogeneous 
learners. We use the abbreviation "RL" for "reinforcement 
learning" and "C" for "combined", suggesting that the al- 
gorithm involves learning the expected utility as well as the 
strategies. We consider combined fully distributed, payoff 
and strategy reinforcement learning (CODIPAS-RL) in the 
form: 

ft+i = ft + nii(Ai,t, oi,t, C/i,t, ui,t, ft) 
ui,t+i = ui,4 +ni2(Aii,f,ai,t,?7i,t,ff,Ui,f) 

' gt+l = gt + n2l(A2,t, 02,t, ?72,t, U2,t, gt) 
U2,t+1 = U2,t +n22(M2,t,a2,t,C/2,t,gt,U2,t) 
O 0,O,;,f G A, j G {1,2}, 

where Ilii,Ili2,i G {1,2}, are properly chosen functions. 
The parameters Xi^t, l-k,t are learning rates indicating players' 
capabilities of information retrieval and update. The vectors 
ft G J^, gt G ^ are mixed strategies of the players at time 
t. VLi^t,^ G {1,2}, are estimated average payoffs updated 
at each iteration t, and Ui^ni G {1,2}, are the perceived 
payoffs received by players at time t. 

We identify below five different special cases of this 
general class of learning algorithms, each one important in 
its own right. 

1) CRLO: The first COmbined fully Distributed PAyoff 
and Strategy Reinforcement Learning (CODIPAS-RL) al- 
gorithm is CRLO given in (|4]i below, which captures the 
procedure in [5] for both payoffs and strategies. At every 
time step t, PI and P2 each chooses an action according to 
their estimations and their mixed strategy vectors ft and gt, 
respectively. Based on the joint action, each player perceives 
his instantaneous payoff Ui^t, i G {1,2}, and updates his 
strategy vectors. The strategy and utility updates are not 
coupled and do not involve optimal choices of the players. 
The players make updates by taking a weighted average 
of the current observed payoff and the quantities from the 
previous iteration. The indicator function ll{ait} is a unit 
vector of appropriate dimension with one of its components 



corresponding to the action chosen at time t, ai^t, being 1 
and the others being zeros. The step size parameters A^.t 
need to be small enough such that Xi^tUi^t < 1 for all t. 



ft+i 

Ul,t+1 
U2,t+1 



{ai,t=ai} 



Ul,t + Ml,t]l{ai 
U2,t + M2,t]l{a2 



*=a,} iU: 



1 



{02,4=02} 



2,t 



t=a2} {U: 



gt) 



ai e ^1 



U2,t) , a2 e A2 
(4) 



Collecting all this, the CRL2 algorithm is then as given 
below: 

= (1 - Ai,t)ft + Ai,tCri(ft,Ui,t) 

= (1 - A2,t)gt + A2,tO-2(gt,U2,t) 
= "2.* + ^^I{a2.t=a2} (t^2,t " U2,t) 




2j CRLl: Algorithm CRLl given in (|5]l below is another 
combined algorithm that learns the average utility and the 
mixed strategies concurrently. This is a Boltzmann-Gibbs 
based CODIPAS-RL. In a similar fashion as in CRLO, PI 
and P2 select their actions based on their current strategy 
distributions. However, the updates on the strategies and the 
average payoff follow reinforcement learning and A^ t and 
/ii t are the learning rates for the payoffs and the strategies 
respectively, satisfying Assumption III-A.61 and — > 0,i G 
{1,2}. 



(9) 

4) RL2: The learning algorithm (fTOt updates strategies 
simultaneously [1], [5]. 

V^{ai,t=ai} - ft) 
f]l{a2,t=a2} - gt) 



ft+1 

St+1 



— ft + Ai,tJ7i,t • 

= gt+A2,t?72,t 



(10) 



5) RL3: In RL3, we normalize RL2 by some constant n 
and C. This algorithm has appeared in [ 1 ] and is summarized 
below in (fTTT) : 




C(k+1) 
nC+U2.t 



[gt + ?72,t]l{a2.t=a2}] 



(11) 



< 



ft+1 = (l-Ai,t)ft + Ai,t/3i,.(ui,t) 

Ul,t+1 = Ul,t + t=ai} (C^l,t - Ul,t) 

gt+1 = (1 - A2,t)gt + A2,t/32,e(u2,t) 

U2,t+1 = U2,t + ^^^]l{a2,t=a2} (^2,t - U2^t) 



ai G Ah 



The following assumption on learning rates is adopted for 
all the above listed learning schemes. 

Assumption II-A.6: The learning rates Ai.t,Mi.t, i G 
'{1,2}, satisfy the following conditions: 



-00, 



< +00, t e {1,2} (13) 



(5) «>i t^i 

where /3i,e : K'"^'' g {1,2}, is the Boltzmann- 

Gibbs strategy or the soft-max function parameterized by e ^ 
0, which takes in the average payoff vector and produces a 
vector that assigns more weight to the maximum component. 
The weight assigned to a particular action a; e Ai,i G {1,2} 
is given by 



t^i 



^i,e(Ui,t)(ai 



I, G G {1,2}. (6) 



The learning rate which perhaps has the simplest form that 
satisfies the conditions of Assumption III- A. 6l is the harmonic 
sequence, i.e., (Rl) /i,;.t = j^- To study learning on 
different time scales, we need to consider other learning 
rates. Typical learning rates are (R2) /i^ t ^ 



(R3) j 
1, c' > 0. 



(R4) M., 



(t+l)log(t+l)' 



< Pi 



It is clear that when e is high, the output of the /3i,^ function 
does not distinguish among the actions and assign equal 
weights to them; when e approaches zero, f3i,^ function bears 
more resemblance with the maximum function, assigning 1 
to the action yielding the maximum average payoff but zeros 
to the other actions [4]. 

3) CRLl: The procedure for the CODIPAS-RL algorithm 
CRL2 is similar to CRLl but only differs in the use of soft- 
max function. In place of the Boltzmann-Gibbs strategy, we 
adopt imitative Boltzmann-Gibbs strategy which is weighted 
by the current strategy vector [7], and is given by (Ji : RI-^*I x 
RI-^'I — >• RI-^'I , i G {1, 2}. The component-wise mapping for 
PI is expressed by 

cri(ft,ui,t)(ai] 

Likewise, for P2, we have 



/t+Tlog2(t+l) ' '"''^ (t+c')"* ' 2 

It is clear that the learning rate (Rl) is faster than 
(R2) and (R3). In addition, by scaling pi in (R4), we can 
obtain learning rates on different time scales. 



ft{a 



1 e 



(T2(gt,U2^t)(a2) 



gt(a2)e^"^''^°^^ 



(7) 



(8) 



B. Basic properties 

1 ) Properties of RL2, RL3 and CRLO : The algorithm 
RL2 has been studied by Borgers and Sarin in [5]. The 
algorithm RL3 is a normalized version of RL2. This version 
has been studied by Arthur in [1]. These authors have 
shown that RL2 goes to a pseudo-trajectory of the replicator 
dynamics when the learning rate A^ t goes to zero. Similarly 
the reinforcement learning RL3 goes to a trajectory of an 
adjusted version of the replicator equation. 

The learning algorithm CRLO is obtained by combining 
these strategy reinforcement learnings with a payoff rein- 
forcement learning (Q-leaming). The Q-learning is known 
to be convergent to the expected payoffs if all the actions 
are sufficiently used and the learning parameters satisfy 
the standard conditions. The combination of these two ap- 
proaches gives a new learning algorithm called combined 
fully distributed payoff and strategy reinforcement learning 
(CODIPAS-RL). With this new algorithm, the players will be 



able to learn both expected payoffs and the associated opti- 
mal strategies i.e., if {{t,ui,t,gt,U2,t) — > {i* , ill, g* ,11*2), 
then (f * , g* ) is a saddle point of the expected game and 
Esl[J(.s, f*, g*) = ii'l = c — Mj. Moreover, the strategies are 
generated by the replicator equation: 

/t(ai) = ft{ai)[ui{ea,,gt) - ^ U2{ea'^,gt)ft{a'i)] 
gt{a2) = gt{a2)[u2{ft,ea2) " ^ "2 (ft, 60^)54(4)] 

where Mi(f*,g*) = EsU(s, f*, g*) and U2{.) = c - iti(.). 

A major inconvenience with CODIPAS-RL, CRLO, RL2 
and RL3 is that the rest points (equilibrium states) of the 
corresponding ODEs are not necessarily equilibria of the 
expected game. For example, aU the faces of the simplex are 
forward invariant (when started on one face, the trajectory 
of the replicator dynamics remains on that face). As well 
known, the game may not have an equiUbrium on that face. 
Therefore, the outcome of the replicator dynamics may not 
be an equilibrium. To resolve this problem, one can fix the 
starting point at the relative interior of the simplex (for 
example, the uniform distribution can be chosen as initial 
point). Then, we have the following conclusions. 

(51) If started in the interior, the dominated strategies will 
be eliminated. 

(52) If started in the interior, and if the trajectory goes to 
the boundary, then the outcome is an equilibrium. 

(53) If there is a cyclic orbit of the dynamics, the limit cycle 
contains an equilibrium in its interior 

(54) The expected payoff is learned if CODIPAS-RL CRLO 
is used; /(ai) > implies that ui_t(ai) — > 
MgUls, , g), and similarly for P2, 5(02) > impUes 
that U2,t{a2) — > c - EsU(s, f , ea^)- 

Another way of eliminating the non-equilibrium rest points 
is to perturb the game. The strategy can be perturbed using a 
small deviation from (f,g), i.e., an action ai will be chosen 
with probability (1 — e)/(ai) + j^^- 

2) Properties of CRLl and CRL2: Numerically, the ap- 
proximation of CRLO, RL2 and RL3 can lead to the boundary 
of the simplex. To solve this problem, we propose a mod- 
ified version of CODIPAS-RL based on Boltzmann-Gibbs 
distribution. These are the coupled reinforcement learning 
CRLl and CRL2. Since the Boltzmann-Gibbs distribution 
never vanishes, the new algorithm CODIPAS-RL CRLl 
based on Boltzmann-Gibbs is well defined for any initial 
condition and preserves the property that every rest point is a 
Boltzmann-Gibbs equilibrium, also called logit equilibrium, 
i.e., the fixed point of the mapping /3i.c(EsUi(s, ., g)) = 
f , /32,£(EsU2(s, f , .)) = g which is an e— saddle-point equi- 
librium. Thus, by choosing e arbitrarily small, an approx- 
imate solution is obtained. The main advantage of this 
Boltzmann-Gibbs distribution is that it is a smooth mapping 
(a regularized version of the best-response correspondence). 



III. Main results 

In this section, we obtain ODE approximations of the 
learning algorithms in Section II and show the convergence 
of different heterogeneous learning algorithms to saddle- 
point solutions. 

A. Convergence to ODE: the combined learning algorithms 

We first examine the case where the players learn via 
different schemes but on the same time scale or by the 
same learning rate, i.e., the factor Ai t = At,i e {1,2}, 
independent of the players. We use /3i_e(gt) : A(^2) 
A(^i) and /?2,£(ft) : A(^i) A(U2) to denote PI 
and P2's Boltzmann-Gibbs responses to the other player's 
mixed strategies^ and /3i,£(gt)(ai) := /3i,e(wi(eai, gt)); 

^2,e(ft)(a2) := /32^e(w2(ft, Baa)), «! G A , a2 G A- 

Theorem III-A.l: The combined learning algorithm with 
different learners using CRLl, RL2, RL3 converges to the 
joint system of ODEs. In particular, if PI uses CRLl and P2 
adopts RL2, then the ODE is given by 

{^ui.t(ai) = Mi(eai,gt) - iti,t(ai), oi e A, 
ft = /3i,.(gt)-ft, 
54(02) = .9t(a2)[w2(ft,ea2) 
-Ea'e^2"2(ft,e^)5t(a2)],a2 G A- 

(14) 

Moreover, if P2 adopts RL3 in lieu of RL2, then one has 
the adjusted replicator dynamics instead of the standard 
replicator equation. 

We now have the following corollary corresponding to 
different learning rates for the two players. 

Corollary III-A.2: In the heterogeneous learning where 
players choose to adopt one learning scheme among CRLl, 
RL2, RL3 and with different learning rates, we have the 
following results. 

(CI) If PI uses CRLl and P2 learns through RL2 with a 
rate k2 faster than Pi's rate, then the ODE is given by 

^Mi,t(ai) = iti(eai,gt) - Mi,t(ai), a-i e Ai 

f* = /3i,.(gO-ft 

3t(«2) = k2gt{a2)[u2{ea2,it), 

-Ea'2e^2 '"2(ea^,ft)fft(a2)],a2 £ A. 

Moreover, if P2 adopts RL3 in lieu of RL2, then one has 
the k2— adjusted replicator dynamics instead of the standard 
replicator equation. 

(C2) If PI uses CRLl with a rate of learning fci faster 
than P2 who learns with RL2, then the ODE is given by 

{■^ui,t{ai) = ui{eai,gt) — ui.tiai), ai e Ai, 
it = fci [/3i,.(gt) - ft] , 
gt{a2) = gt{a2)[u2{ea2,it) 
-E4e.A2 '"2(ea^,ft)9t(4)],a2 € A2 
Lemma III-A.3: (Explicit Solutions of Smooth BR 
Equation): Given P2's trajectory {gt'}t' and an initial 
condition fo, the smooth best response equation 

ft = /3i,.(gt)-ft (15) 



in ( fT4] i has a unique solution given by the vectorial function and the ODE 



6(gt)(ai) = /o(ai)e * + ( 



zi,t'{ai) e* dt', ai e Ai, 
(16) 

where zi^t' = /3i,e(gt')- particular, if P2 is a slow learner 
i.e., gf = g, constant in time, then the smooth best response 
equation of PI converges to 

6(g)(ai) = (1 - e"*)/3i,e(g)(ai) + e-7o(ai), ai e A, 

(17) 

which goes to /3i,e(g) when t — > +oo. 

Lemma III-A.4: (Explicit Solutions of Replicator 
Equation): Given P2's trajectory {gc}*' and an interior 
initial condition fo, the replicator equation in (fl4l has a 
unique solution given by the vectorial function ^i{st){ai) = 



, ) dt' ' 



fli 6 Ai, with a normalization 



factor /q. In particular, if P2 is a slow learner, i.e. gt = g, 
constant in time, then the replicator equation of PI converges 
to 

6(g)(ai) = IM^^ "1 ^ -^1- 

Note that these solutions are in the interior of the simplex 
for t finite, but the trajectory can be arbitrarily close to the 
boundary when t goes to infinity. In particular, if we assume 
that the other player is a slow learner, i.e., ^ 0: then, 

./o(ai) 



a(g)(ai)W 



fo{a[) 



^{aiEBRiis)}, 



when e — > 0. The set BRi{g) denotes the set of pure 
maximizers of f that maximize Esl[J(s, f , g). 

Proposition III-A.5: Given any time- varying mixed 
strategies {gt}t, the explicit solution to the replicator equa- 
tion is '^i(gt)(ai) = Pi i{V){ai), where V is the payoff 
vector defined by V{ai] iii(eai,gt), where gt = 
I /p gt' dt' . In particular, if the time-average sequence gt 
converges to g*, then the explicit solution ^i(gt) converges 
to a smooth best response to g* . 

Theorem III-A.6 (Two Different Learners): Consider 
two learners: one learns faster than the other 

(Tl) Assume that PI is a slow learner of RL2 or RL3 and 
P2 is a fast learner of CRLl, i.e., — > as i ^> 00 . 

Then almost surely, ||gt — C2(f)|| — > as t goes to infinity, 
where ^2(f) = /32,e(f), and 

ft{ai) ^ ft{ai)[ui{eai,l32,e{it))- ^ ft{a[)ui{e^,^, I32,t{it))] 

(18) 

generates the asymptotic pseudo-trajectory of {ffjt^o- 

(T2) Assume that P2 is slow learner of RL2 or RL3 and 
PI is a fast learner of CRLl, i.e., ^ 
Then, almost surely, ||ft — ■CiCg 
where 

„ttii(e„j ,g) 

ei(g)(ai) 



as t 



00 



as t goes to infinity. 



■, ai E Ai 



gi = /52,£(6(gt)) - gt 



(19) 



generates the asymptotic pseudo-trajectory of {gtjt^o- 

Note that this last ODE differs from the replicator dy- 
namics, the best response dynamics, the logit dynamics and 
fictitious play, etc. 

Remark III-A.7: Note that from Lemma IIII-A.3I 
^i{s){o-i) = Pi i(g)(ai)- This means that if the trajectories 
remain in the interior of the simplex, the time averages of the 
replicator dynamics and the smooth best-response dynamics 
are asymptotically close (the norm of the difference between 
the two trajectories is small when t is sufficiently large). 
The mixed strategy P^ i has full support for any t > 0, i.e., 
^1 (g) remains in the relative interior of the simplex for all 
t. 

The following theorem, whose proof can be found in the full 
report [12], says that under CRLl, the dominated strategies 
will be eliminated in the long-term. 

Theorem III-A.8: Consider algorithm CRLl. If a strat- 
egy fli is strictly dominated, then /t(ai) — > when 
t — > 00 and e — 0. 

B. Convergence to saddle points 

From (Tl) of Theorem IIII-A.6I we see that the case 
with PI as the slow learner leads to ODE in (fTFt whose 
solution is given by Lemma IIII-A.4I which is in the form 
of the smooth best response to P2. Knowing that gt also 
converges almost surely to the smooth best response to PI, 
we conclude that the learning algorithm studied in (Tl) 
converges to an e— saddle point. Similarly, from (T2) of 
Theorem lIII-A.6l when PI acts as a fast learner, the ODE in 
(fTgj l has its solution given by Lemma UlI- A. 3 1 and leads to the 
smooth best response when t — > 00. In addition, from (Tl) 
and from Proposition IIII-A.5I ff converges to ^1 = (3^ 1, 
which is asymptotically close to the smooth best-response 
dynamics. Hence we can conclude that the algorithm studied 
in (T2) also converges to an e— saddle point. When e goes to 
zero, the stationary points of these heterogeneous dynamics 
converge to the saddle points of the expected game. We 
can extend the preceding argument to any combination of 
replicator dynamics and smooth best response dynamics. 
Using Theorem IIII-A.ll and its corollary IIII-A.2I we arrive 
at the following result. 

Theorem III-B.l: Consider the case of two different 
learners in which one learns faster than the other Let the 
initial condition be an interior point of the simplex. The 
heterogeneous dynamics: (i) CRLO with CRLl, (ii) CRLO 
with CRL2, (iii) CRLl with CRL2, (iv) CRLl with RL2, 
and (v) CRLl with RL3 lead almost surely to an e— saddle 
point of the expected game. 

IV. Application and Simulation 

In this section, we illustrate the heterogeneous learning 
algorithms with an example motivated by computer security. 
In a network intrusion detection system, an intruder attempts 
to scan the host machines and seek their vulnerabilities while 



Fig. 1. Tlie payoffs to tlie players Fig. 2. The mixed strategies of 
with both players using CRLl. the players with both players using 

CRLl. 



Fig. 3. The payoffs to the players Fig. 4. The mixed strategies of 
with the attacker using CRLl and the players with the attacker using 
the defender using RL2. CRLl and the defender using RL2. 



where the 



the intrusion detector monitors the suspicious behavior and 
raises an alarm when attacks are detected. The attacker and 
the defender can dynamically adapt their strategies from 
learning the history of the behaviors of each other and 
their own payoffs. It is common that the learning pattern of 
the attacker is different from the one used by the defender 
since learning schemes depend on an individual's preference 
and rationality as well as the information observed by 
each person. Hence, in the context of computer security, 
heterogeneity of the learning algorithm is essential because 
it offers extra degrees of freedom to model agent's behavior. 

Consider a two-person game with one party being the 
defender (PI) and the other party the attacker (P2). The 
defender has two actions available for each play, i.e., either 
to defend (D) or not to defend (ND), while the attacker has 
two actions either to attack or not to attack. Th^ deterministic 
payoff matrix is given by M = 

columns correspond to the defender strategies (D) and (ND) 
whereas the rows correspond to the attacker strategies (A) 
and (NA). The stochastic payoff matrix U is a function of 

random matrix S = ■^i •^a whose components are 

[ S3 S4 J 

uniformly distributed on [—1,1]. It is given by U = M + S. 

At the equilibrium, the attacker selects its actions accord- 
ing to f * = [0.4, 0.6]"'^ while the defender chooses its actions 
using g* = [0.2,0.8]"^. The strategy pair (f*,g*) forms a 
saddle point solution to the game EU = M, yielding the 
game value 2.6. We show in Figures [T] and |2] the payoffs 
and the mixed strategies of the players, respectively, when 
both adopt the CRLl learning algorithm. By setting e = 
we observe that the payoffs of PI choosing actions N and 
NA at t = 8000 are 2.5890 and 2.6073 respectively, which 
are close to the game value 2.6. For P2, the payoffs at 
t = 8000 are -2.6578 and -2.5855 for actions N and ND, 
respectively. The difference between the payoff and game 
value is explained by the soft-max parameter e. When e 
approaches 0, the average payoffs will approach the game 
value. The convergence of CRLl is slow. In Figures [T] 
and 12] we observe that the payoff values and the mixed 
strategy probabilities converge roughly after t = 6000. In 
Figures [3] and ID we show the temporal evolution of the 
payoffs and mixed strategies of the attacker and defender 
using the heterogeneous learning algorithm in which the 



attacker follows CRLl whereas the defender uses RL2. We 
initialize the payoffs to be and the strategy vectors {q = 
[1/3, 2/3], = [1/3,2/3]. We set the parameter e = ^ 
in the soft-max best response function of the attacker The 
convergence of the learning process is shown after t = 80s. 

V. Concluding remarks 

We have presented heterogeneous distributed learning al- 
gorithms for two-person zero-sum stochastic games along 
with their general convergence and non-convergence prop- 
erties. Our results subsume many known results regarding 
learning optimal strategies with different time scales and with 
different learning schemes. Interesting work that we leave for 
the future is to extend these results to stochastic games with 
controlled states and nonzero-sum stochastic games with 
incomplete information. 
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